{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--BOOK_INFORMATION-->\n",
    "<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
    "*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
    "\n",
    "*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) | [Contents](Index.ipynb) | [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb) >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Engineering 特征工程"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The previous sections outline the fundamental ideas of machine learning, but all of the examples assume that you have numerical data in a tidy, ``[n_samples, n_features]`` format.\n",
    "In the real world, data rarely comes in such a form.\n",
    "With this in mind, one of the more important steps in using machine learning in practice is *feature engineering*: that is, taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.\n",
    "\n",
    "前面的部分概述了机器学习的基本思想，但是所有的例子都假设您在一个整洁的、n采样的n特性格式中有数字数据。在现实世界中，数据很少以这样的形式出现。考虑到这一点，在实践中使用机器学习的一个更重要的步骤是特性工程：即，将您的问题的任何信息，并将其转换为您可以用来构建特性矩阵的数字。\n",
    "\n",
    "In this section, we will cover a few common examples of feature engineering tasks: features for representing *categorical data*, features for representing *text*, and features for representing *images*.\n",
    "Additionally, we will discuss *derived features* for increasing model complexity and *imputation* of missing data.\n",
    "Often this process is known as *vectorization*, as it involves converting arbitrary data into well-behaved vectors.\n",
    "\n",
    "在本节中，我们将介绍一些常见的特性工程任务示例：表示直言数据的特性、表示文本的特性，以及表示图像的特性。此外，我们还将讨论派生的特性，以增加模型的复杂性和对缺失数据的估算。通常这个过程被称为矢量化，因为它涉及到将任意的数据转换成行为良好的向量。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Categorical Features  分类特征\n",
    "\n",
    "One common type of non-numerical data is *categorical* data.\n",
    "For example, imagine you are exploring some data on housing prices, and along with numerical features like \"price\" and \"rooms\", you also have \"neighborhood\" information.\n",
    "For example, your data might look something like this:\n",
    "\n",
    "\n",
    "非数值数据的一种常见类型是直言数据。例如，假设你正在探索一些关于房价的数据，以及“价格”和“房间”等数字特征，你也有“邻居”信息。例如，您的数据可能是这样的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = [\n",
    "    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},\n",
    "    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},\n",
    "    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},\n",
    "    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might be tempted to encode this data with a straightforward numerical mapping:\n",
    "\n",
    "您可能会尝试用一个简单的数字映射来对这些数据进行编码："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "{'Queen Anne': 1, 'Fremont': 2, 'Wallingford': 3};"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It turns out that this is not generally a useful approach in Scikit-Learn: the package's models make the fundamental assumption that numerical features reflect algebraic quantities.\n",
    "Thus such a mapping would imply, for example, that *Queen Anne < Fremont < Wallingford*, or even that *Wallingford - Queen Anne = Fremont*, which (niche demographic jokes aside) does not make much sense.\n",
    "\n",
    "\n",
    "事实证明，在Scikit-Learn中，这通常不是一个有用的方法：包的模型做出了一个基本的假设，即数值特性反映了代数数量。因此，这样的映射将意味着，例如，Queen Anne < Fremont < Wallingford, or even that Wallingford - Queen Anne = Fremont,这（除了小众的人口统计学笑话）没有多大意义。\n",
    "\n",
    "In this case, one proven technique is to use *one-hot encoding*, which effectively creates extra columns indicating the presence or absence of a category with a value of 1 or 0, respectively.\n",
    "When your data comes as a list of dictionaries, Scikit-Learn's ``DictVectorizer`` will do this for you:\n",
    "\n",
    "在这种情况下，一种已被证明的技术是使用一种热编码，这有效地创建了额外的列，表示一个值为1或0的类别的存在或缺席。当你的数据作为一份字典列表时，Scikit-Learn的``DictVectorizer``会为你做这些："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[     0,      1,      0, 850000,      4],\n",
       "       [     1,      0,      0, 700000,      3],\n",
       "       [     0,      0,      1, 650000,      3],\n",
       "       [     1,      0,      0, 600000,      2]], dtype=int32)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction import DictVectorizer\n",
    "vec = DictVectorizer(sparse=False, dtype=int)\n",
    "vec.fit_transform(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the 'neighborhood' column has been expanded into three separate columns, representing the three neighborhood labels, and that each row has a 1 in the column associated with its neighborhood.\n",
    "With these categorical features thus encoded, you can proceed as normal with fitting a Scikit-Learn model.\n",
    "\n",
    "注意，“neighborhood”列已经扩展为三个独立的列，代表三个相邻的标签，并且每一行在与它的邻居相关联的列中都有一个1。有了这些分类特征，你就可以正常地使用Scikit-Learn模型。\n",
    "\n",
    "To see the meaning of each column, you can inspect the feature names:\n",
    "\n",
    "要查看每一列的含义，您可以检查特性名称："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['neighborhood=Fremont',\n",
       " 'neighborhood=Queen Anne',\n",
       " 'neighborhood=Wallingford',\n",
       " 'price',\n",
       " 'rooms']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vec.get_feature_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is one clear disadvantage of this approach: if your category has many possible values, this can *greatly* increase the size of your dataset.\n",
    "However, because the encoded data contains mostly zeros, a sparse output can be a very efficient solution:\n",
    "\n",
    "这种方法有一个明显的缺点：如果您的类别有许多可能的值，那么这将极大地增加数据集的大小。然而，由于编码的数据大多包含0，所以稀疏的输出可能是一个非常有效的解决方案："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<4x5 sparse matrix of type '<class 'numpy.int32'>'\n",
       "\twith 12 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vec = DictVectorizer(sparse=True, dtype=int)\n",
    "vec.fit_transform(data)\n",
    "#vec.get_feature_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many (though not yet all) of the Scikit-Learn estimators accept such sparse inputs when fitting and evaluating models. ``sklearn.preprocessing.OneHotEncoder`` and ``sklearn.feature_extraction.FeatureHasher`` are two additional tools that Scikit-Learn includes to support this type of encoding.\n",
    "\n",
    "\n",
    "许多（尽管还不是全部）的Scikit-Learn评估器在训练和验证模型时接受稀疏矩阵的输入。``sklearn.preprocessing.OneHotEncoder`` 和 ``sklearn.feature_extraction.FeatureHasher`` 是另外两个工具，Scikit-Learn包括支持这种类型的编码。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text Features  文本特征\n",
    "\n",
    "Another common need in feature engineering is to convert text to a set of representative numerical values.\n",
    "For example, most automatic mining of social media data relies on some form of encoding the text as numbers.\n",
    "One of the simplest methods of encoding data is by *word counts*: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.\n",
    "\n",
    "特性工程的另一个常见需求是将文本转换为一组具有代表性的数值。例如，大多数自动挖掘社交媒体数据依赖于某种形式的编码文本作为数字。编码数据最简单的方法之一是通过单词计数：您获取每一个文本片段，计算其中每个单词的出现，并将结果放在一张表中。\n",
    "\n",
    "For example, consider the following set of three phrases:\n",
    "\n",
    "例如，考虑以下三种短语："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = ['problem of evil',\n",
    "          'evil queen',\n",
    "          'horizon problem']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a vectorization of this data based on word count, we could construct a column representing the word \"problem,\" the word \"evil,\" the word \"horizon,\" and so on.\n",
    "While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's ``CountVectorizer``:\n",
    "\n",
    "\n",
    "对于基于单词计数的数据的矢量化，我们可以构建一个列，表示\"problem,\" the word \"evil,\" the word \"horizon,\"，等等。虽然手工做这件事是可能的，但可以通过使用Scikit-Learn的CountVectorizer来避免单调乏味。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3x5 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 7 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "vec = CountVectorizer()\n",
    "X = vec.fit_transform(sample)\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is a sparse matrix recording the number of times each word appears; it is easier to inspect if we convert this to a ``DataFrame`` with labeled columns:\n",
    "\n",
    "结果是一个稀疏矩阵记录每个单词出现的次数;如果我们把它转换成带有标签的DataFrame，则更容易查看。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>evil</th>\n",
       "      <th>horizon</th>\n",
       "      <th>of</th>\n",
       "      <th>problem</th>\n",
       "      <th>queen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   evil  horizon  of  problem  queen\n",
       "0     1        0   1        1      0\n",
       "1     1        0   0        0      1\n",
       "2     0        1   0        1      0"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.DataFrame(X.toarray(), columns=vec.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some issues with this approach, however: the raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms.\n",
    "One approach to fix this is known as *term frequency-inverse document frequency* (*TF–IDF*) which weights the word counts by a measure of how often they appear in the documents.\n",
    "The syntax for computing these features is similar to the previous example:\n",
    "\n",
    "\n",
    "然而，这种方法存在一些问题：原始的单词计数导致了对经常出现的单词过于重视的特性，而这在某些分类算法中可能是次优的。解决这个问题的一种方法是称为“频率-反向文档频率”（TF-IDF），它通过测量文档中出现的频率来衡量单词的数量。计算这些特性的语法类似于前面的示例："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>evil</th>\n",
       "      <th>horizon</th>\n",
       "      <th>of</th>\n",
       "      <th>problem</th>\n",
       "      <th>queen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.517856</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.680919</td>\n",
       "      <td>0.517856</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.605349</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.795961</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.795961</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.605349</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       evil   horizon        of   problem     queen\n",
       "0  0.517856  0.000000  0.680919  0.517856  0.000000\n",
       "1  0.605349  0.000000  0.000000  0.000000  0.795961\n",
       "2  0.000000  0.795961  0.000000  0.605349  0.000000"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "vec = TfidfVectorizer()\n",
    "X = vec.fit_transform(sample)\n",
    "pd.DataFrame(X.toarray(), columns=vec.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For an example of using TF-IDF in a classification problem, see [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Image Features  图像特征\n",
    "\n",
    "Another common need is to suitably encode *images* for machine learning analysis.\n",
    "The simplest approach is what we used for the digits data in [Introducing Scikit-Learn](05.02-Introducing-Scikit-Learn.ipynb): simply using the pixel values themselves.\n",
    "But depending on the application, such approaches may not be optimal.\n",
    "\n",
    "另一个常见的需求是为机器学习分析适当地编码图像。最简单的方法是我们在引入Scikit-Learn时使用的数字数据：简单地使用像素值本身。但是，根据应用程序的不同，这些方法可能不是最优的。\n",
    "\n",
    "A comprehensive summary of feature extraction techniques for images is well beyond the scope of this section, but you can find excellent implementations of many of the standard approaches in the [Scikit-Image project](http://scikit-image.org).\n",
    "For one example of using Scikit-Learn and Scikit-Image together, see [Feature Engineering: Working with Images](05.14-Image-Features.ipynb).\n",
    "\n",
    "\n",
    "对图像的特征提取技术的全面总结远远超出了本节的范围，但是您可以在Scikit-Image项目中找到许多标准方法的优秀实现。举一个使用Scikit-Learn和Scikit-Image的例子，请参阅功能工程：使用图像。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Derived Features 衍生特征\n",
    "\n",
    "Another useful type of feature is one that is mathematically derived from some input features.\n",
    "We saw an example of this in [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) when we constructed *polynomial features* from our input data.\n",
    "We saw that we could convert a linear regression into a polynomial regression not by changing the model, but by transforming the input!\n",
    "This is sometimes known as *basis function regression*, and is explored further in [In Depth: Linear Regression](05.06-Linear-Regression.ipynb).\n",
    "\n",
    "另一种有用的特性是在数学上从一些输入特性派生出来的。当我们从输入数据中构造多项式特性时，我们在超参数和模型验证中看到了一个例子。我们看到，我们可以将线性回归转换成多项式回归，而不是通过改变模型，而是通过转换输入！这有时被称为``基函数回归``，并进一步深入探讨：线性回归。\n",
    "\n",
    "For example, this data clearly cannot be well described by a straight line:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADuVJREFUeJzt3V1sZHd9xvHnqdeUIQm1xI5o1pvW6o2lNpR4O4qCVorShOIEorCiuQgS0CBV2xfUJmplVPeiFb3hwhKiL1LRNqFNS8JLg2OFiMSkCghxwaLZeMEJG1cpCiJ22p1QOS9lBBvz64XHYXewd85k58yZ3+b7kUZ7fM5/5zz6Z+fx8Zn/xI4IAQDy+IWqAwAA+kNxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJLOvjCfdv39/TE1NlfHUAHBROnHixPMRUS8ytpTinpqaUrPZLOOpAeCiZPv7RcdyqwQAkqG4ASAZihsAkqG4ASAZihsAkulZ3LanbZ886/Gi7TuHEQ4A8PN6LgeMiDVJV0mS7TFJ65IeKDkXAKSwtLKuheU1bWy2dWCiprnZaR2ZmSz1nP2u475B0n9FROH1hgBwsVpaWdf84qraZ7YkSeubbc0vrkpSqeXd7z3u2yR9towgAJDNwvLaq6W9o31mSwvLa6Wet3Bx236DpFsk/fsex4/abtputlqtQeUDgJG1sdnua/+g9HPFfZOkxyPif3Y7GBHHIqIREY16vdDH7QEgtQMTtb72D0o/xf1+cZsEAF41Nzut2vjYOftq42Oam50u9byF3py0/SZJvyPpD0pNAwCJ7LwBOZKrSiLiR5LeUmoSAEjoyMxk6UXdjU9OAkAyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJFOouG1P2L7f9lO2T9l+R9nBAAC721dw3N9KeiQibrX9BklvKjETAOA8eha37TdLulbS7ZIUET+R9JNyYwEA9lLkVsmvSWpJ+mfbK7bvsn1J9yDbR203bTdbrdbAgwIAthUp7n2SDkn6x4iYkfR/kv6ie1BEHIuIRkQ06vX6gGMCAHYUKe5nJT0bEcc7X9+v7SIHAFSgZ3FHxH9L+oHt6c6uGyR9t9RUAIA9FV1V8ieS7u2sKPmepA+XFwkAcD6FijsiTkpqlJwFAFAAn5wEgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIhuIGgGQobgBIZl+RQbafkfSSpC1Jr0REo8xQAIC9FSrujt+OiOdLSwIAKIRbJQCQTNHiDklfsX3C9tHdBtg+artpu9lqtQaXEABwjqLFfTgiDkm6SdJHbF/bPSAijkVEIyIa9Xp9oCEBAD9TqLgjYqPz52lJD0i6usxQAIC99Sxu25fYvmxnW9K7JD1RdjAAwO6KrCp5q6QHbO+Mvy8iHik1FQBgTz2LOyK+J+ntQ8gCACiA5YAAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkAzFDQDJUNwAkEzh4rY9ZnvF9kNlBgIAnN++PsbeIemUpDeXEWRpZV0Ly2va2GzrwERNc7PTOjIzWcapACC1Qlfctg9Keo+ku8oIsbSyrvnFVa1vthWS1jfbml9c1dLKehmnA4DUit4q+aSkj0r6aRkhFpbX1D6zdc6+9pktLSyvlXE6AEitZ3HbvlnS6Yg40WPcUdtN281Wq9VXiI3Ndl/7AeD1rMgV92FJt9h+RtLnJF1v+zPdgyLiWEQ0IqJRr9f7CnFgotbXfgB4PetZ3BExHxEHI2JK0m2SHouIDwwyxNzstGrjY+fsq42PaW52epCnAYCLQj+rSkqzs3qEVSUA0JsjYuBP2mg0otlsDvx5AeBiZftERDSKjOWTkwCQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMn0LG7bb7T9Ldvftv2k7Y8NIxgAYHf7Coz5saTrI+Jl2+OSvmH74Yj4ZsnZAFRkaWVdC8tr2ths68BETXOz0zoyM1l1LHT0LO6ICEkvd74c7zyizFAAqrO0sq75xVW1z2xJktY325pfXJUkyntEFLrHbXvM9klJpyU9GhHHy40FoCoLy2uvlvaO9pktLSyvVZQI3QoVd0RsRcRVkg5Kutr2ld1jbB+13bTdbLVag84JYEg2Ntt97cfw9bWqJCI2JX1N0o27HDsWEY2IaNTr9QHFAzBsByZqfe3H8BVZVVK3PdHZrkl6p6Snyg4GoBpzs9OqjY+ds682Pqa52emKEqFbkVUll0u6x/aYtov+CxHxULmxAFRl5w1IVpWMriKrSr4jaWYIWQCMiCMzkxT1COOTkwCQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMlQ3ACQDMUNAMn0LG7bV9j+qu1Ttp+0fccwggEAdrevwJhXJP15RDxu+zJJJ2w/GhHfLTkbzmNpZV0Ly2va2GzrwERNc7PTOjIzWXUsAEPQs7gj4jlJz3W2X7J9StKkJIq7Iksr65pfXFX7zJYkaX2zrfnFVUmivIHXgb7ucduekjQj6XgZYVDMwvLaq6W9o31mSwvLaxUlAjBMhYvb9qWSvijpzoh4cZfjR203bTdbrdYgM6LLxma7r/0ALi6Fitv2uLZL+96IWNxtTEQci4hGRDTq9fogM6LLgYlaX/sBXFyKrCqxpLslnYqIT5QfCb3MzU6rNj52zr7a+JjmZqcrSgRgmIpccR+W9EFJ19s+2Xm8u+RcOI8jM5P6+PvepsmJmixpcqKmj7/vbbwxCbxOFFlV8g1JHkIW9OHIzCRFDbxO8clJAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZChuAEiG4gaAZPb1GmD705JulnQ6Iq4sPxIweEsr61pYXtPGZlsHJmqam53WkZnJqmMBr0mRK+5/kXRjyTmA0iytrGt+cVXrm22FpPXNtuYXV7W0sl51NOA16VncEfF1Sf87hCxAKRaW19Q+s3XOvvaZLS0sr1WUCLgwA7vHbfuo7abtZqvVGtTTAhdsY7Pd135g1A2suCPiWEQ0IqJRr9cH9bTABTswUetrPzDqWFWCi97c7LRq42Pn7KuNj2ludrqiRMCF6bmqBMhuZ/UIq0pwsSiyHPCzkq6TtN/2s5L+OiLuLjsYMEhHZiYpalw0ehZ3RLx/GEEAAMVwjxsAkqG4ASAZihsAkqG4ASAZihsAknFEDP5J7Zak77/Gv75f0vMDjDMo5OoPufpDrv6MYq4LzfSrEVHoY+elFPeFsN2MiEbVObqRqz/k6g+5+jOKuYaZiVslAJAMxQ0AyYxicR+rOsAeyNUfcvWHXP0ZxVxDyzRy97gBAOc3ilfcAIDzqKy4bX/a9mnbT+xx3Lb/zvbTtr9j+9AIZLrO9gu2T3Yef1V2ps55r7D9VdunbD9p+45dxlQxX0VyDX3ObL/R9rdsf7uT62O7jPlF25/vzNdx21Mjkut2262z5uv3y87VOe+Y7RXbD+1ybOhzVTBXVXP1jO3Vzjmbuxwv/7UYEZU8JF0r6ZCkJ/Y4/m5JD0uypGskHR+BTNdJeqiCubpc0qHO9mWS/lPSr4/AfBXJNfQ568zBpZ3tcUnHJV3TNeaPJX2qs32bpM+PSK7bJf1DBf/G/kzSfbv9t6pirgrmqmqunpG0/zzHS38tVnbFHb1/CfF7Jf1rbPumpAnbl1ecqRIR8VxEPN7ZfknSKUnd/3PpKuarSK6h68zBy50vxzuP7jdz3ivpns72/ZJusO0RyDV0tg9Keo+ku/YYMvS5KphrVJX+Whzle9yTkn5w1tfPagRKQdI7Oj/qPmz7N4Z98s6PqTPavlo7W6XzdZ5cUgVz1vkR+6Sk05IejYg95ysiXpH0gqS3jEAuSfrdzo/Y99u+ouxMkj4p6aOSfrrH8UrmqkAuafhzJW1/s/2K7RO2j+5yvPTX4igX927f0au+Onlc2x9Lfbukv5e0NMyT275U0hcl3RkRL3Yf3uWvDGW+euSqZM4iYisirpJ0UNLVtq/sGlLJfBXI9SVJUxHxm5L+Qz+70i2F7ZslnY6IE+cbtsu+UueqYK6hztVZDkfEIUk3SfqI7Wu7jpc+X6Nc3M9KOvs76EFJGxVlkSRFxIs7P+pGxJcljdveP4xz2x7XdjneGxGLuwypZL565apyzjrn3JT0NUk3dh16db5s75P0SxribbK9ckXEDyPix50v/0nSb5Uc5bCkW2w/I+lzkq63/ZmuMVXMVc9cFczVznk3On+elvSApKu7hpT+Whzl4n5Q0oc679BeI+mFiHiuykC2f3nn3p7tq7U9fz8cwnkt6W5JpyLiE3sMG/p8FclVxZzZrtue6GzXJL1T0lNdwx6U9Hud7VslPRadd5aqzNV1L/QWbb9vUJqImI+IgxExpe03Hh+LiA90DRv6XBXJNey56pzzEtuX7WxLepek7lVopb8WK/st797llxBr+80aRcSnJH1Z2+/OPi3pR5I+PAKZbpX0R7ZfkdSWdFvZ/4A7Dkv6oKTVzv1RSfpLSb9yVrahz1fBXFXM2eWS7rE9pu1vFF+IiIds/42kZkQ8qO1vOP9m+2ltXz3eVnKmorn+1PYtkl7p5Lp9CLl+zgjMVZFcVczVWyU90LkW2Sfpvoh4xPYfSsN7LfLJSQBIZpRvlQAAdkFxA0AyFDcAJENxA0AyFDcAJENxA0AyFDcAJENxA0Ay/w9oplBLS/6cOgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "x = np.array([1, 2, 3, 4, 5])\n",
    "y = np.array([4, 2, 1, 3, 7])\n",
    "plt.scatter(x, y);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Still, we can fit a line to the data using ``LinearRegression`` and get the optimal result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFVxJREFUeJzt3W1wXOV5xvHrRhYgbBxhW1qQsBAGI1Y4DaICQogdYqSIBIa4aT6QhrRkQtw0aRKaVkzdSZtpZzqdjmcySV+mGTdJmzQvTUqMJ2USlEhACJmGRMY0JpbFW82LRCTZIBuDsCX57ofdNbLQas/aOrvnWf1/Mxqvdo+1Nw/eS0dnz6Vj7i4AQDhOK/cAAIDiENwAEBiCGwACQ3ADQGAIbgAIDMENAIEhuAEgMAQ3AASG4AaAwCyJ44uuWrXKm5ub4/jSAFCRdu7cud/d66JsG0twNzc3q7+/P44vDQAVycyeiboth0oAIDAENwAEhuAGgMAQ3AAQGIIbAAJTMLjNrMXMHp3xccjM7ijFcACANyp4OqC7D0q6XJLMrErSkKS7Y54LAIKwY9eQtvYManh8Qg21NeruatGmtsZYn7PY87ivl/SUu0c+3xAAKtWOXUPasn23JianJUlD4xPasn23JMUa3sUe475F0rfjGAQAQrO1Z/B4aOdMTE5ra89grM8bObjN7HRJN0v6rzyPbzazfjPrHxsbW6j5ACCxhscnirp/oRSzx/1uSY+4+8hcD7r7Nndvd/f2urpIdXsACFpDbU1R9y+UYoL7A+IwCQAc193VoprqqhPuq6muUndXS6zPG+nNSTM7S1KnpD+MdRoACEjuDchEnlXi7q9KWhnrJAAQoE1tjbEH9Ww0JwEgMAQ3AASG4AaAwBDcABAYghsAAkNwA0BgCG4ACAzBDQCBIbgBIDAENwAEhuAGgMAQ3AAQGIIbAAJDcANAYAhuAAgMwQ0AgSG4ASAwBDcABIbgBoDAENwAEBiCGwACQ3ADQGAIbgAIDMENAIGJFNxmVmtmd5nZXjMbMLNr4h4MADC3JRG3+6Kke939/WZ2uqSzYpwJADCPgsFtZsslbZB0myS5+1FJR+MdCwCQT5RDJWskjUn6NzPbZWZfNrOlszcys81m1m9m/WNjYws+KAAgI0pwL5F0haR/cfc2Sa9I+vPZG7n7Nndvd/f2urq6BR4TAJATJbifl/S8uz+c/fwuZYIcAFAGBYPb3X8j6Tkza8nedb2kPbFOBQDIK+pZJZ+U9M3sGSVPS/pwfCMBAOYTKbjd/VFJ7THPAgCIgOYkAASG4AaAwBDcABAYghsAAkNwA0BgCG4ACAzBDQCBIbgBIDAENwAEhuAGgMAQ3AAQGIIbAAJDcANAYAhuAAgMwQ0AgSG4ASAwBDcABIbgBoDAENwAEBiCGwACQ3ADQGAIbgAIDMENAIEhuAEgMEuibGRm+yS9LGla0pS7t8c5FAAgv0jBnfVOd98f2yQAgEg4VAIAgYka3C7pR2a208w2z7WBmW02s34z6x8bG1u4CQEAJ4ga3Ne6+xWS3i3pE2a2YfYG7r7N3dvdvb2urm5BhwQAvC5ScLv7cPbPUUl3S7oqzqEAAPkVDG4zW2pmZ+duS3qXpMfiHgwAMLcoZ5WkJN1tZrntv+Xu98Y6FQAgr4LB7e5PS3pLCWYBgGBNTR9T1Wmm7E5urIo5jxsAMMPBiUn95PEx9Q2M6P69o9r+8Wt1cf2y2J+X4AaAIjx74FX1Doyod2BEv/i/FzV1zLVy6el612XnqgQ725IIbgCY1/Qx16PPjasvG9aPjxyWJK2tX6aPblijjnS9Ll99jqpOK1Fqi+AGgDd49eiUfvrEfvXuGdH9g6Paf/ioqk4zXdW8Qn95U5M60vW6YOXSss1HcAOApN8cfE19e0fUu2dEP3vqgI5OHdPZZy7RO1vqdX26XtddUq83nVVd7jElEdwAFil316+HD6lvYFS9AyPaPXRQktS04izdevUF6mit15XNK1Rdlbxf6URwA1g0jkxN63+eOqDegRH1DYzqhYOvyUy6oukc3XlDizrTKV1cv6wkp/SdCoIbQEU7cPiI7ts7qr6BUT34xJhePTqtmuoqbbhklf6k8xJtvLReq5adUe4xi0JwA6go7q6nxg7rx3tG1Tcwop3PviR36dzlZ+p32hrV0ZrSNWtW6szqqnKPetIIbgDBm5w+pl/ue/H48epnDrwqSVrXuFyf2rhWna0pXdawPPGHQKIiuAEEaXZr8dBrUzq96jS97eKVun39Gl1/ab0aamvKPWYsCG4AwZivtdiRTmn92lVaekblx1rl/xcCCNZ8rcXb169RZ2vpW4tJQHADSJSktxaTgOAGUHb5WovXtdSrI2GtxSQguAGUXK61mCvC5FqLq1fUZFqL6XpdeWEyW4tJQHADKIl8rcW21bW684YWdaRTWhtAazEJCG4AscnXWly/NtzWYhIQ3AAWTL7WYmr5GZnWYjqlay4Ku7WYBAQ3gFOSr7V4WUOmtdiRTmldY+W0FpOA4AZQtFxrsXfPiB4YXFytxSQguAFEMldrccXx1mK91q+tWxStxSRglQHMKV9r8eJF3lpMgsjBbWZVkvolDbn7TfGNBKBc5mstfvbG1epIp9S8anG3FpOgmD3uT0sakLQ8jkF27BrS1p5BDY9PqKG2Rt1dLdrU1hjHUwGYIV9r8R2X1KmzNUVrMYEiBbeZnS/pRkl/K+kzCz3Ejl1D2rJ9tyYmpyVJQ+MT2rJ9tyQR3sACm6+1+MGrm9SZTtFaTLioe9xfkHSnpLPjGGJrz+Dx0M6ZmJzW1p5BghtYAPO1Fru7WtTZSmsxJAWD28xukjTq7jvN7Lp5ttssabMkNTU1FTXE8PhEUfcDKIzWYuWKssd9raSbzew9ks6UtNzMvuHut87cyN23SdomSe3t7V7MEA21NRqaI6Q5DxSIbr7W4qa2RnXSWqwYBYPb3bdI2iJJ2T3uP5sd2qequ6vlhGPcklRTXaXurpaFfBqg4kxOH1P/vpeyh0BGtI/W4qKQiPO4c8exOasEKGy+ay1+hNbiomDuRR3ViKS9vd37+/sX/OsCi1W+1uLGS+tpLVYIM9vp7u1RtuX/NJBAx465Hn1+XL17uNYi3ojgBhIi11rsGxjRfXu51iLyI7iBMuJaizgZBDdQQrnWYu53V9NaxMkguIGY5VqLfQOZ86uHudYiThHBDcTgwOEjun8wc6GBnz4xpldmtBbvoLWIU0RwAwugUGuRay1iIRHcwEmamj6mX87RWlzXmGktdramdFkDrUUsPIIbKMKh1yb1k8Ex9Q6M6IHBMR2cmKS1iJIjuIECcq3Fvr0jevjpTGtx5dLT1dmaUkc6pfVrV9FaREnxrw2YZWZrsW9gVIMjL0vKtBY/umGNOtK0FlFeBDegTGvxoSf2q3fO1mIrrUUkCsGNRSvXWuwbGNVDT+6ntYhgENxYNPK1FptWnKVbr75AHel6WosIAsGNilaotdiZTuliWosIDMGNipNrLfYNjOjBx19vLW64hNYiKgPBjeDlWou9A6Pq3TOiR559ScdcOnf5mZnWYmtK16yhtYjKQXAjSLnWYl/2qjAzW4ufpLWICkdwIxi0FoEMghuJ9tyLr19rkdYikMG/eCTKfK1FrrUIZBDcKLsTW4tj2n/4CK1FYB4EN8pi5NBr2V+HOqqfPblfR2gtApER3CgJd9eeFw6pd8+o+vaO6FfPv95a/CCtRaAoBYPbzM6U9KCkM7Lb3+Xun4t7MITvyNS0fv70i9nj1bQWgYUSZY/7iKSN7n7YzKolPWRmP3T3n8c8GwL04itHdd/e0Te0FrnWYlh27BrS1p5BDY9PqKG2Rt1dLdrU1ljusZBVMLjd3SUdzn5anf3wOIdCODKtxVcyp+zNaC1yrcVw7dg1pC3bd2ticlqSNDQ+oS3bd0sS4Z0QkY5xm1mVpJ2SLpb0z+7+cKxTIdGmpo+p/5mX1LvnxNbiZQ20FivB1p7B46GdMzE5ra09gwR3QkQKbneflnS5mdVKutvM1rn7YzO3MbPNkjZLUlNT04IPivLKtRb7BkZ0P63FijY8PlHU/Si9os4qcfdxM3tA0g2SHpv12DZJ2ySpvb2dQykVYK7W4orjrcV6rV9bR2uxAjXU1mhojpDmG3NyRDmrpE7SZDa0ayR1SPr72CdDyeVai30DI+rdQ2txseruajnhGLck1VRXqburpYxTYaYou0vnSfpa9jj3aZK+6+73xDsWSmW+1uJnb0yrI51S8ypai4tJ7jg2Z5UkV5SzSn4lqa0Es6BERg69dvzyXbQWMZdNbY0EdYJxgHIRyNdaXL2iRr93dZM60ylai0BACO4KVai12JFOaS2tRSBIBHcFefGVo7p/b+YQCK1FoHIR3AGb2VrsGxjRzmdeby2+t61RnbQWgYpEcAemUGuxI53SukZai0AlI7gDkK+1eM1FK/WRt1+o69MpyhHAIkJwJ1Sh1uLb19ZpGa1FYFHilZ8Q+VqLF9NaBDALwV1GudZi38Co+vaO0loEEAnBXWK0FgGcKoI7ZrQWASw0gjsG87UWu7ta1NlKaxHAySO4F8i811rsuETvvLRedWfTWgRw6gjuk0RrEUC5ENxFmK+1+Mcb16qT1iKAEiC4C5i3tci1FgGUAcE9h3ytxY50Sp2ttBYBlBfpo8KtxY50vdqaaC0CSIZFG9z5rrV4ZfM5tBYBJNqiCu6RQ69lzwIZfb21eMYSvaOlTp2tKVqLAIJQ0cFdqLXYkU7pyuYVOn0JrUUA4ai44M7XWrw821rsSKd0SYrWIoBwVURw01oEsJgEGdwzW4u9e0b0yLMnthY70vV620WraC0CqEgFg9vMVkv6uqRzJR2TtM3dvxj3YLNNTR/TL/e9lDllb0ZrsfW8TGuxI12vdQ1v0mmcsgegwkXZ456S9Kfu/oiZnS1pp5n92N33xDzb8dZi78CIHpjjWosb0yk1LtLW4o5dQ9raM6jh8Qk11Naou6tFm9oayz0WgBIoGNzu/oKkF7K3XzazAUmNkmIJ7rlai+ecVU1rcYYdu4a0ZftuTUxOS5KGxie0ZftuSSK8gUWgqAQ0s2ZJbZIejmOYex/7jT72jZ2SpIvqluoj6y9UZzpFa3GWrT2Dx0M7Z2JyWlt7BgluYBGIHNxmtkzS9yTd4e6H5nh8s6TNktTU1HRSw1x9YeZai9enU7qQ1mJew+MTRd0PoLJEap6YWbUyof1Nd98+1zbuvs3d2929va6u7qSGOWfp6bp9/RpCu4B8v42Q31IILA4Fg9syTZWvSBpw98/HPxIK6e5qUc2sUx1rqqvU3dVSpokAlFKUPe5rJX1I0kYzezT78Z6Y58I8NrU16u/e92Y11tbIJDXW1ujv3vdmjm8Di0SUs0oeksQ7gwmzqa2RoAYWKX67EgAEhuAGgMAQ3AAQGIIbAAJDcANAYAhuAAgMwQ0AgSG4ASAwBDcABIbgBoDAENwAEBiCGwACQ3ADQGAIbgAIDMENAIEhuAEgMAQ3AASG4AaAwBDcABAYghsAAkNwA0BgCG4ACAzBDQCBIbgBIDBLCm1gZl+VdJOkUXdfF/9IwMLbsWtIW3sGNTw+oYbaGnV3tWhTW2O5xwJOSpQ97n+XdEPMcwCx2bFrSFu279bQ+IRc0tD4hLZs360du4bKPRpwUgoGt7s/KOnFEswCxGJrz6AmJqdPuG9iclpbewbLNBFwahbsGLeZbTazfjPrHxsbW6gvC5yy4fGJou4Hkm7Bgtvdt7l7u7u319XVLdSXBU5ZQ21NUfcDScdZJah43V0tqqmuOuG+muoqdXe1lGki4NQUPKsECF3u7BHOKkGliHI64LclXSdplZk9L+lz7v6VuAcDFtKmtkaCGhWjYHC7+wdKMQgAIBqOcQNAYAhuAAgMwQ0AgSG4ASAwBDcABMbcfeG/qNmYpGdO8q+vkrR/AcdZKMxVHOYqDnMVJ4lznepMF7h7pNp5LMF9Ksys393byz3HbMxVHOYqDnMVJ4lzlXImDpUAQGAIbgAITBKDe1u5B8iDuYrDXMVhruIkca6SzZS4Y9wAgPklcY8bADCPsgW3mX3VzEbN7LE8j5uZ/YOZPWlmvzKzKxIw03VmdtDMHs1+/FXcM2Wfd7WZ3W9mA2b2azP79BzblGO9osxV8jUzszPN7Bdm9r/Zuf56jm3OMLPvZNfrYTNrTshct5nZ2Iz1uj3uubLPW2Vmu8zsnjkeK/laRZyrXGu1z8x2Z5+zf47H438tuntZPiRtkHSFpMfyPP4eST+UZJLeKunhBMx0naR7yrBW50m6Inv7bEmPS2pNwHpFmavka5Zdg2XZ29WSHpb01lnbfFzSl7K3b5H0nYTMdZukfyrDv7HPSPrWXP+vyrFWEecq11rtk7Rqnsdjfy2WbY/bC1+E+L2Svu4ZP5dUa2bnlXmmsnD3F9z9keztlyUNSJr9y6XLsV5R5iq57Boczn5anf2Y/WbOeyV9LXv7LknXm5klYK6SM7PzJd0o6ct5Nin5WkWcK6lify0m+Rh3o6TnZnz+vBIQCpKuyf6o+0Mzu6zUT579MbVNmb21mcq6XvPMJZVhzbI/Yj8qaVTSj90973q5+5Skg5JWJmAuSfrd7I/Yd5nZ6rhnkvQFSXdKOpbn8bKsVYS5pNKvlZT5ZvsjM9tpZpvneDz212KSg3uu7+jl3jt5RJla6lsk/aOkHaV8cjNbJul7ku5w90OzH57jr5RkvQrMVZY1c/dpd79c0vmSrjKzdbM2Kct6RZjrvyU1u/tvSerV63u6sTCzmySNuvvO+Tab475Y1yriXCVdqxmudfcrJL1b0ifMbMOsx2NfryQH9/OSZn4HPV/ScJlmkSS5+6Hcj7ru/gNJ1Wa2qhTPbWbVyoTjN919+xyblGW9Cs1VzjXLPue4pAck3TDroePrZWZLJL1JJTxMlm8udz/g7keyn/6rpN+OeZRrJd1sZvsk/aekjWb2jVnblGOtCs5VhrXKPe9w9s9RSXdLumrWJrG/FpMc3N+X9PvZd2jfKumgu79QzoHM7NzcsT0zu0qZ9TtQguc1SV+RNODun8+zWcnXK8pc5VgzM6szs9rs7RpJHZL2ztrs+5L+IHv7/ZLu8+w7S+Wca9ax0JuVed8gNu6+xd3Pd/dmZd54vM/db521WcnXKspcpV6r7HMuNbOzc7clvUvS7LPQYn8tlu0q7zbHRYiVebNG7v4lST9Q5t3ZJyW9KunDCZjp/ZL+yMymJE1IuiXuf8BZ10r6kKTd2eOjkvQXkppmzFby9Yo4VznW7DxJXzOzKmW+UXzX3e8xs7+R1O/u31fmG85/mNmTyuw93hLzTFHn+pSZ3SxpKjvXbSWY6w0SsFZR5irHWqUk3Z3dF1ki6Vvufq+ZfUwq3WuR5iQABCbJh0oAAHMguAEgMAQ3AASG4AaAwBDcABAYghsAAkNwA0BgCG4ACMz/A13F6BU5ePHBAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "X = x[:, np.newaxis]\n",
    "model = LinearRegression().fit(X, y)\n",
    "yfit = model.predict(X)\n",
    "plt.scatter(x, y)\n",
    "plt.plot(x, yfit);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's clear that we need a more sophisticated model to describe the relationship between $x$ and $y$.\n",
    "\n",
    "很明显，我们需要一个更复杂的模型来描述$x$和$y$之间的关系。\n",
    "\n",
    "One approach to this is to transform the data, adding extra columns of features to drive more flexibility in the model.\n",
    "For example, we can add polynomial features to the data this way:\n",
    "\n",
    "一种方法是转换数据，添加额外的特性列，以在模型中驱动更大的灵活性。例如，我们可以以这种方式向数据添加多项式特性："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[  1.   1.   1.]\n",
      " [  2.   4.   8.]\n",
      " [  3.   9.  27.]\n",
      " [  4.  16.  64.]\n",
      " [  5.  25. 125.]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "poly = PolynomialFeatures(degree=3, include_bias=False)\n",
    "X2 = poly.fit_transform(X)\n",
    "print(X2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The derived feature matrix has one column representing $x$, and a second column representing $x^2$, and a third column representing $x^3$.\n",
    "Computing a linear regression on this expanded input gives a much closer fit to our data:\n",
    "\n",
    "\n",
    "衍生的特征矩阵有一列表示$x$，另一列表示$x^2$，第三列表示$x^3$.在这个扩展的输入上计算一个线性回归使我们更接近于我们的数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAW4AAAD8CAYAAABXe05zAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3Xd8VFXCxvHfSYMEEkJIKCGEUEOvodnFAoKr2NG1F9DdtawuKG59d99XVFZX3V1XELuyKojs6oKAXZRiqKEk9JZQQiAJIX1y3j8SWMRAZiAzd2byfD+ffBhmLjMPB/LMzZl77zHWWkREJHCEOB1AREQ8o+IWEQkwKm4RkQCj4hYRCTAqbhGRAKPiFhEJMCpuEZEAo+IWEQkwKm4RkQAT5o0njY+PtykpKd54ahGRoLR8+fID1toEd7b1SnGnpKSQnp7ujacWEQlKxpgd7m6rqRIRkQCj4hYRCTAqbhGRAKPiFhEJMCpuEZEAo+IWEQkwdRa3MSbVGLPquK9CY8xDvggnIiI/Vudx3NbaLKAfgDEmFMgGPvRyLhGRgDBnZTZPfZLJnoJSWsc05rHLujGmf1uvvqanJ+BcBGyx1rp9oLiISLCaszKbSbMzKKlwAbC3sJRJszMAvFrens5xjwX+6Y0gIiKBZsr8rGOlfVRJhYsp87O8+rpuF7cxJgK4Aph5ksfHGWPSjTHpubm59ZVPRMRv5eSXeHR/ffFkj/syYIW1dl9tD1prp1lr06y1aQkJbl0nRUQkoCVEN6r1/sTYSK++rifFfSOaJhEROaZ5VMSP7osMD2XCiFSvvq5bxW2MiQIuAWZ7NY2ISID4amMuWfsOM6ZfIm1jIzFA29hIJl/d2z+OKrHWFgMtvJpERCRAuKosk+duIDkuiqev7UtEmG/PZdSZkyIiHvpgxW4y9x5m4shUn5c2qLhFRDxSUu7imQVZ9GsXy+jebRzJoOIWEfHAK4u2sq+wjF+P7o4xxpEMKm4RETcdKCrjpa+2cmmPVgxKiXMsh4pbRMRNz3+6iZIKF49e1s3RHCpuERE3bMktYsayndw0OJlOCU0dzaLiFhFxw1PzMokMD+XBi7s4HUXFLSJSl++3H2TB+n3ce35H4pvWfpq7L6m4RUROwVrLE3M30DqmMXed09HpOICKW0TklOZm7GXlznwevrQrkRGhTscBVNwiIidVXlnF0/Mz6dY6mmsGJDkd5xgVt4jISby9ZAc78oqZNKo7oSHOnGxTGxW3iEgtCkoqeOHzTZzbJZ7zu/rXGgMqbhGRWrz45WYKSip4zOGTbWqj4hYROcHuQ8W89u12rurflp6JzZyO8yMqbhGREzyzYCMG+NWl3l3J5nSpuEVEjrM2u4APV2Zz5zkdvL525OlScYuI1Dh6sk1ckwjuu6CT03FOSsUtIlLjy6xcvtuSxwPDOxPTONzpOCel4hYRASpdVUyet4GUFlHcNKS903FOScUtIgLMWr6bjfuKeHRkN0fWkfSEf6cTEfGB4vJKnl24kYHtmzOyV2un49RJxS0iDd7LX29j/+EyHh/VzbF1JD2h4haRBm3/4VKmfr2Fy3q1ZmB759aR9IRbxW2MiTXGzDLGZBpjNhhjhnk7mIiILzz36SbKK6uYONL/Tm0/mTA3t3se+MRae60xJgKI8mImERGf2Lz/MO99v4tbhranQ3wTp+O4rc7iNsbEAOcBtwNYa8uBcu/GEhHxvifnZRIVHsoDFzm/jqQn3Jkq6QjkAq8ZY1YaY6YbY3701mSMGWeMSTfGpOfm5tZ7UBGR+rRkax6fbtjPfRd2Iq5JhNNxPOJOcYcBA4B/WGv7A0eAx07cyFo7zVqbZq1NS0jwr2vXiogcr6qq+tT2xGaNufPsDk7H8Zg7xb0b2G2tXVrz+1lUF7mISED6aE0Oa3YX8MilqTQO9491JD1RZ3Fba/cCu4wxR69veBGw3qupRES8pKzSxZT5WfRoE8NV/ds6Hee0uHtUyf3AOzVHlGwF7vBeJBER73nzux3sPlTC23f1IcSP1pH0hFvFba1dBaR5OYuIiFflF5fz1883cX7XBM7pEu90nNOmMydFpMH42+ebKSqrZNKowDnZpjYqbhFpEHYdLObNxTu4dmAS3VrHOB3njKi4RaRBeHp+FiEh8PAl/rmOpCdU3CIS9Fbvyuej1Tncc25HWjdr7HScM6biFpGgZq3l/+ZuIL5pBOPP9991JD2h4haRoPbphv0s23aQBy/uStNG7h4B7d9U3CIStCpdVTw5bwMdE5owdlA7p+PUGxW3iAStd7/fxZbcIzw2shvhocFTd8HzNxEROU5RWSXPfbqRwSlxXNKjldNx6lVwTPiIiJxg2ldbOFBUzvTbugfEOpKe0B63iASdfYWlvPzNNi7v04Z+7WKdjlPvVNwiEnSeXbCRyqoqJo4I7FPbT0bFLSJBJWvvYWYu38Wtw1JIbhGcy+OquEUkqEyet4GmjcK4f3hnp6N4jYpbRILGt5sP8GVWLr8Y3pnYqMBaR9ITKm4RCQpH15FsGxvJrcNSnI7jVSpuEQkKc1Zlsy6nkIkjA3MdSU+ouEUk4JVWuPjz/Cx6t23GT/okOh3H61TcIhLwXvt2OzkFpTw+qnvAriPpCRW3iAS0g0fKefGLzVzUrSXDOrVwOo5PqLhFJKC98NkmjpRX8thlwXmyTW1U3CISsLYfOMLbS3Zww6BkurSKdjqOz6i4RSRgPT0/k4iwEH55SReno/iUW1cHNMZsBw4DLqDSWpvmzVAiInVZvuMQczP28tDFXWgZHfjrSHrCk8u6XmitPeC1JCIibrK2+mSbhOhG3HNuR6fj+JymSkQk4Mxft5flOw7x8CVdaRIk60h6wt3itsACY8xyY8w4bwYSETmVClcVT32SRZeWTbluYJLTcRzh7lvV2dbaHGNMS2ChMSbTWvv18RvUFPo4gOTk5HqOKSJSbcbSnWw7cIRXb08jLIjWkfSEW39ra21Oza/7gQ+BwbVsM81am2atTUtISKjflCIiQGFpBc9/tolhHVtwYWpLp+M4ps7iNsY0McZEH70NXAqs9XYwEZETvfTlFg4eKefxUcG3jqQn3JkqaQV8WDNIYcAMa+0nXk0lInKCnPwSXlm0jTH9Eumd1MzpOI6qs7ittVuBvj7IIiJyUs8s2IgFfjUi1ekojmuYM/siElDW5xQye+Vu7jgrhaTmwbmOpCdU3CLi9ybP20CzyHB+dmHwriPpCRW3iPi1rzbm8s2mA9w/vAvNIsOdjuMXVNwi4rdcVZbJczeQHBfFLUPbOx3Hb6i4RcRvfbBiN5l7DzNxZCoRYaqrozQSIuKXSspdPLMgi37tYhndu43TcfyKiltE/NIri7ayr7CMX49u2Cfb1EbFLSJ+50BRGS99tZVLe7RiUEqc03H8jopbRPzO859uoqTCxaMNaB1JT6i4RcSvbMktYsayndw0OJlOCU2djuOXVNwi4leempdJZHgoD17csNaR9ISKW0T8xrJtB1mwfh/3nt+R+KaNnI7jt1TcIuIXjq4j2TqmMXed0/DWkfSEiltE/MJ/Mvawalc+D1/alciIUKfj+DUVt4g4rqzSxdOfZNGtdTTXDGiY60h6QsUtIo57e8lOdh4sZtKo7oSG6GSbuqi4RcRRBSUV/PXzTZzbJZ7zu2q9WneouEXEUS9+sZmCkgomXdbd6SgBQ8UtIo7ZfaiY177bztX9k+iRGON0nICh4hYRx/x5fhYG+NWIrk5HCSgqbhFxRMbuAuasyuGuczrQplmk03ECiopbRHzu6Mk2cU0iuPeCTk7HCTgqbhHxuS+y9rN4ax4PXtSFmMZaR9JTKm4R8alKVxWT52bSIb4JNw1JdjpOQHK7uI0xocaYlcaYj70ZSESC28zlu9m0v4hHR6YSHqp9x9MR5sG2DwIbAK8cszNnZTZPzstkb2EprWMa89hl3RjTv603XkpEHHKkrJJnF24krX1zRvRs7XScgOXW250xJgkYDUz3Rog5K7OZNDuDvYWlAOwtLOWxD9YwZ2W2N15ORBzy8jdbyT1cxqRRWkfyTLj7c8pzwESgyhshpszPoqTC9YP7SiurmDxvgzdeTkQcsP9wKdO+3sqo3q0Z2L6503ECWp3FbYy5HNhvrV1ex3bjjDHpxpj03Nxcj0Lk5JfUev++wjK2Hzji0XOJiH/6y8JNVLiqmDhC60ieKXf2uM8GrjDGbAfeBYYbY94+cSNr7TRrbZq1Ni0hwbMLxSTG1n7wfYiB66YuZuO+wx49n4j4l037DvPe9zv56ZD2pMQ3cTpOwKuzuK21k6y1SdbaFGAs8Lm19ub6DDFhRCqR4T+8cHpkeCgTR3TDADdMXcza7IL6fEkR8aEn52XSJCKMBy7SOpL1wS+OxRnTvy2Tr+5N29hIDNA2NpLJV/fm3gs68f74YURFhHHjy0tYvuOg01FFxEOLt+TxWeZ+fnZhZ+KaRDgdJygYa229P2laWppNT0+vt+fLzi/h5ulL2VdYyvRb0zirc3y9PbeIeE9VleXKv39LXlEZn//qAhqHa0mykzHGLLfWprmzrV/scdelbWwk740fSrvmUdz++vd8nrnP6Ugi4oaP1uSQkV3Ar0akqrTrUUAUN0DL6Ma8O24oqa2iGf/WcuZm7HE6koicQmlF9TqSPRNjGNNPJ9PVp4ApboDmTSJ4554h9E2K5RczVvDB8t1ORxKRk3hz8Xay80t4fFR3QrSOZL0KqOIGiGkczpt3DeasTvE8MnM1by3Z4XQkETlBfnE5f/t8MxekJnC2PpOqdwFX3ABREWFMvy2Ni7q15Ldz1jLt6y1ORxKR4/z1880UlVVqHUkvCcjiBmgcHspLtwxkdJ82PDE3k78s3Ig3jpAREc/szCvmzcXbuW5gO1JbRzsdJyh5cnVAvxMeGsILY/sTGR7K859tori8ksd18RoRRz09P5OwkBAevlTrSHpLQBc3QGiI4elr+tAkIpSXv9lGcbmLP13ZSx+GiDhg1a58Pl6zhweGd6ZVTGOn4wStgC9ugJAQwx+u6ElkRBgvfbWFknIXT1/bhzBdpF3EZ6y1PPGfDcQ3jWDc+VpH0puCorgBjDE8OjKVJhGhPLNwIyUVLp4f25+IMJW3iC8sXL+PZdsP8r9jetG0UdBUi18KqlYzxnD/RV34zejuzFu7l/FvpVN6wnW+RaT+VbiqePKTTDolNGHsoHZOxwl6QVXcR919bkeeuKo3X27M5Y7XvudIWaXTkUSC2rvf72Jr7hEeu6y7pih9IGhH+KYhyTx7fV+WbT/ILa8spaCkwulIIkGpqKyS5z/dyOAOcVzcvaXTcRqEoC1ugKv6J/H3mwaQkV3AjdOWkFdU5nQkkaAz9astHCgq59c6FNdngrq4AUb2as3Lt6axJbeIsdOWsK9mQWIROXN7C0p5+Zut/KRvIn3bxTodp8EI+uIGuCC1JW/cOZic/BKun7qY3YeKnY4kEhSeXZhFVRVMHJHqdJQGpUEUN8DQji14++4hHDpSzvUvLWZrbpHTkUQCWubeQmYu382tw9rTLi7K6TgNSoMpboD+yc15d9wwyiqruH7qErL2ahFikdM1eW4m0Y3C+MXwzk5HaXAaVHED9EiM4b3xQwkNgRumLWbN7nynI4kEnEWbDvDVxlzuH96F2CitI+lrDa64ATq3jGbm+LNo2iiMm15eyvfbtQixiLuqqixPzN1AUvNIbj2rvdNxGqQGWdwAyS2imHnvMFpGN+LWV5axaNMBpyOJBIQPV2azfk8hE0ak0ihM60g6ocEWN0CbZpG8N34Y7VtEcefr3/Ppei1CLHIqpRUunlmQRZ+kZvykT6LTcRqsBl3cAAnRjXh33FC6t4nm3reX89HqHKcjifitV7/dRk5BqdaRdFiDL26A2KgI3r57CAOSm/Pguyt5P32X05FE/E5eURn/+GILF3dvydCOLZyO06DVee1FY0xj4GugUc32s6y1v/d2MF+LbhzOG3cOZtxb6UyctYaSche3nZXidCwRR8xZmc2U+Vnk5JeQGBvJhBGprNqVT3GFi8cu6+Z0vAbPnYvmlgHDrbVFxphwYJExZp61domXs/lcZEQo029L4xczVvL7f6+juNzFfRfogvDSsMxZmc2k2RmU1FwSOTu/hEc/WEOFq4qxg5Pp3FLrSDqtzqkSW+3oaYbhNV9Buypvo7BQXvzpAK7om8hTn2TyzIIsLUIsDcqU+VnHSvuossoqrIWHLu7iUCo5nlvLVBhjQoHlQGfg79bapbVsMw4YB5CcnFyfGX0uPDSEv9zQj6iIUP76+WaKy138ZrSufCYNQ05+Sa33W6BltNaR9AdufThprXVZa/sBScBgY0yvWraZZq1Ns9amJSQk1HdOnwsNMUy+ujd3nJ3CK4u28fiHGbiqtOctwS8xNrLW+9s0U2n7C4+OKrHW5gNfAiO9ksbPGGP43eU9+PmFnfjnsl088v4qKl1VTscS8aoJI1KJDP/hiTXhoYZHR+pDSX/hzlElCUCFtTbfGBMJXAw85fVkfsIYw4QR3YiKCDs29/fCjf11xpgErTH92wLw9CeZ5BSUEhZieOrqPsfuF+e5s8fdBvjCGLMG+B5YaK392Lux/M/PL+zM73/Sg/nr9jHuzeWUlGsRYgleY/q3Zdx5HQF4+dY0rh6Y5HAiOV6de9zW2jVAfx9k8Xt3nN2BqIhQHpudwe2vLeOV2wfRtJFbn++KBJTPM/fx7MKNnNWpBRekBv5nVsFGZ0566IZByTx3Qz/Sdxzip9OXUlCsRYgleBSWVjBh5mrufD2dxNhInriqt46m8kPaXTwNV/ZrS2R4KL+YsZKxLy/hrbsGE9+0kdOxRM7Iok0HmDhrNXsLS/n5hZ144KIu+izHT2mP+zRd2rM1029LY9uBIm6Yupi9BVqEWALTkbJKfjMng5tfWUrjiFA+uO8sJozoptL2YyruM3Be1wTevHMI+wrLuG7qd+w6qEWIJbAs3ZrHZc9/wztLd3L3OR2Y+8C59E9u7nQsqYOK+wwN7hDHO3cPobCkkuteWswWLUIsAaC0wsWfPl7P2JerLzn03rhh/ObyHjQO1152IFBx14O+7WJ5d9xQKququGHqYjbsKXQ6kshJrdx5iFEvfMMri7Zx85D2zHvwXAZ3iHM6lnhAxV1PureJ4b3xwwgPDWHstCWs2qVFiMW/lFW6ePqTTK75x3eUlrt4+64h/GlML5rokNaAo+KuR50SmvL++GE0iwzn5ulLWbo1z+lIIgCszS7gyr99y4tfbuHagUl88svzOKdLvNOx5DSpuOtZu7go3h8/jFYxjbjttWV8tTHX6UjSgFW4qnj+002M+fu35B0p59Xb03j62r7ENA53OpqcARW3F7Ru1pj3xw+jY3xT7nkjnfnr9jodSRqgjfsOc/WL3/GXTzcyuk8bFv7yPIZ3a+V0LKkHKm4vadG0Ef+8Zyg9EmP42Tsr+NeqbKcjSQPhqrK89NUWLn9hEdn5JfzjpwN4fmx/YqMinI4m9USfSnhRs6hw3r57CHe/8T0PvbeKknIXYwcH9iIT4t+25hbxq5mrWbEzn5E9W/O/V/XSWb1BSMXtZU0bhfH6HYO59+3lPDY7g+JyF3ee08HpWBJkqqosbyzezlOfZBIRGsJzN/Tjyn6Jus5IkFJx+0Dj8FCm3jKQB/+5ij9+vJ6SChc/v7Cz07EkSOw6WMyEWatZsvUgF6Ym8OQ1fWgVo9VqgpmK20cahYXyt5v6M2HWGqbMz+JIWSUTRqRqj0hOm7WWfy7bxf/9Zz3GGJ66pjfXp7XT/6kGQMXtQ2GhITxzXV8iI0J58cstFJe7+N3lPQgJ0TeaeGZPQQmPfpDB1xtzObtzC566pg9JzaOcjiU+ouL2sZAQw/+N6UVkeCivLNpGSbmLJ67uTajKW9xgrWX2imz+8NE6Kl2WP13Zk58Oaa83/wZGxe0AYwy/Gd2dJo3CeOGzTRRXuHj2+r6Eh+roTDm5/YdLeXz2Wj7dsI9BKc2Zcm1fUuKbOB1LHKDidogxhocv6UpURChPzsukpNzF327qr6uzSa0+XpPDb+es5Ui5i9+M7s4dZ3fQT2kNmIrbYfee34moiFB+96913PNmOlNvGUhUhP5ZpNrBI+X89l9r+c+aPfRNasYz1/elc8top2OJw9QQfuDWYSlEhofy6AdruO3VZbx6+yCidS2JBm/h+n1Mmp1BQUk5E0akMv68joRpOk1QcfuN69LaERkRykPvruKn05fy5p2DdYpyA1VQUsH/fLSO2Suy6d4mhjfvHEyPxBinY4kfUXH7kcv7JBIZHsp976xg7LQlvHXXEBKidbpyQ/LVxlwenbWG3KIy7h/emfuHdyEiTHvZ8kN1/o8wxrQzxnxhjNlgjFlnjHnQF8Eaqou6t+K12wexI6+YG6YuJie/xOlI4gNFZZVMmp3Bba8uo2njMGbfdxaPXJqq0pZaGWvtqTcwpg3Qxlq7whgTDSwHxlhr15/sz6Slpdn09PT6TdrApG8/yB2vfU9MZDgz7hlC+xY/POxrzspspszPIie/hMTYSCaMSGVM/7YOpZUzsXhLHhNmrSY7v4Rx53bkl5d01dFFDZAxZrm1Ns2dbet8O7fW7rHWrqi5fRjYAKghvCwtJY4Z9wzlSHkl109dzOb9h489NmdlNpNmZ5CdX4IFsvNLmDQ7gzkrdenYQFJS7uIP/17HjS8vISzEMHP8MCaN6q7Sljp59HOYMSYF6A8s9UYY+aHeSc14b9wwqixcP3UJ63IKAJgyP4uSCtcPti2pcDFlfpYTMeU0LN9xkFEvfMPr323n9rNSmPvguaSlaMFecY/bxW2MaQp8ADxkrf3RMubGmHHGmHRjTHpurpbrqi+praN5f/wwGoeFcOO0JazYeeik896aD/d/pRUuJs/bwHUvLaa8sooZdw/hD1f01LH74pE657gBjDHhwMfAfGvts3Vtrznu+rf7UDE3T1/K/sNlRIWHcuBI+Y+2aRsbybePDXcgnbgjY3cBD7+/ik37i7hxcDseH9Vdx+vLMfU6x22qrxH5CrDBndIW70hqXr0IcdvYSApKK4g44USMyPBQJoxIdSidnEp5ZRXPLtzImBe/pbC0gtfuGMTkq/uotOW0uTNVcjZwCzDcGLOq5muUl3NJLVrGNOa98cPo2ioal7XERUVgqN7Tnnx1bx1V4ocy9xZy1Yvf8sJnm7iybyILHjqfC1NbOh1LAlydE2vW2kWArmbjJ+KaRDDjnqHc+fr3rNqVzx/H9GJMv0TtvfmZSlcVU7/eynOfbqRZZDhTbxnIiJ6tnY4lQcKtOW5PaY7b+46UVXLPm+l8tyWPEAO92jZjSIc4hnRowaAOcTSLVJE7ZUtuEY+8v5pVu/IZ3bsNf7yyJy20YK/UwZM5bhV3AKt0VbFk60GWbstj6daDrNqVT7mrCmOge+sYhnSsLvIhHeJo3kTXPfG2qirLq99uY8r8LCIjQvnTlb34Sd9Ep2NJgFBxN1ClFS5W7sw/VuQrdh6irLIKgNRW0f8t8o5xxGsPsF7tyDvChJlrWLb9IBd3b8kTV/emZbQW7BX3qbgFgLJKF2t2F7B0ax5Ltx0kffuhYyfudEpowpCO1XvjQzu20Krgp8lay9tLdzJ57gZCjeH3V/TkmgFttWCveEzFLbWqcFWRkV3A0prplfTthygqqwQgpUUUQzu2OLZXnhgb6XBa/5edX8Kjs9awaPMBzu0Sz1PX9NG4yWlTcYtbKl1VrN9TeKzIl207SGFpdZG3i4s8Nj8+tGMLkppHai+yhrWWmct386eP1uOyll+P7s5Ng5M1PnJGVNxyWlxVlsy9PyzyQ8UVACQ2a3xsamVIxxaktIhqkEW1v7CUSbMz+CxzP4M7xPHna/uS3CLK6VgSBFTcUi+qqiyb9hcd+7Bz6bY8DhRVn2rfMrrRcXPkcXRKaBrURW6t5d+rc/jdv9ZRWuFi4shu3HFWCiFasFfqiYpbvMJay5bcIyzdlld9GOLWPPYfLgMgvmkEgzv896iVri2jg6bU8orK+O2/1jI3Yy/9k2P583V96ZTQ1OlYEmRU3OIT1lq25xUfO2pl6dY8cgpKAWgeFc6glLhje+Xd28QQGoBF/snavfz6wwwOl1by0CVdGHeuFuwV7/CkuHUtSTltxhg6xDehQ3wTxg5OxlrL7kMlLDla5NvyWLB+HwAxjcNqirx6r7xnYoxfF2BBcQV/+GgdH67MpmdiDDPu6Udq62inY4kAKm6pR8YY2sVF0S4uiuvS2gHV1wj/7xz5QT7L3A9A00ZhDGzf/FiR90lqRrifFPkXWft57IM15BWV89DFXfj5hZ39JpsIaKpEfGxfYemxaZWl2w6yeX8RUH1Z2rSU5seOWumT1IxGYb5dwutwaQX/+/EG3kvfRddWTXn2+n70atvMpxmk4dIctwSMA0VlLDuuyDP3Vq+t2SgshAHJ/90j758c69W1GL/dfICJs9awp6CE8ed34qGLu/j8jUMaNhW3BKxDR8pZtv3gscMP1+8pxFqICA2hX7vYY0U+oH1svSz3VVxeyZPzMnlz8Q46xjfhz9f3ZUBy83r4m4h4RsUtQaOgpIL07QePTa+szSnEVWUJCzH0SWp27KiVtJQ4mjbyrMjTtx/kkZmr2ZFXzJ1nd2DCiFQiI7SXLc5QcUvQKiqr/EGRr9ldQGWVJTTE0Csx5gdFfrJrkpdWuHhmQRbTF20jqXkkU67ty9COLXz8NxH5IRW3NBjF5ZWs2JFf6zXJe7SJOXZC0JAOccRGRbB6Vz6PzFzN5v1F3DQkmcdHdfd4T13EG1Tc0mCd6prkXVo2ZeuBI7SMbsRT1/ThvK4JDqcV+S+dgCMNVuPwUIZ1asGwTtVTH8dfk3zZ9kMM6RjHhBHdtLSbBDQVtwS1RmGhDEqJY1BKnNNRROqNTgcTEQkwKm4RkQCj4hYRCTB1znEbY14FLgf2W2t7eT+SSP2bszKbKfOzyMkvITE2kgkjUhnTv63TsUROizt73K8DI72cQ8Rr5qzMZtLsDLLzS7BUL/I7aXYGc1ZmOx1N5LTUWdzW2q+Bgz7IIuIVU+ZnUVLh+sF9JRUupszPciiRyJmptzluY8yI+EJ7AAAFFUlEQVQ4Y0y6MSY9Nze3vp5W5Izl5Jd4dL+Iv6u34rbWTrPWpllr0xISdEaa+I/E2EiP7hfxdzqqRILehBGpRJ5wLe/I8FAmjEh1KJHImdGZkxL0jh49oqNKJFi4czjgP4ELgHhjzG7g99baV7wdTKQ+jenfVkUtQaPO4rbW3uiLICIi4h7NcYuIBBgVt4hIgFFxi4gEGBW3iEiAUXGLiAQYr6w5aYzJBXac5h+PBw7UY5z6olyeUS7PKJdn/DHXmWZqb61167RzrxT3mTDGpLu7YKYvKZdnlMszyuUZf8zly0yaKhERCTAqbhGRAOOPxT3N6QAnoVyeUS7PKJdn/DGXzzL53Ry3iIicmj/ucYuIyCk4VtzGmFeNMfuNMWtP8rgxxrxgjNlsjFljjBngB5kuMMYUGGNW1Xz9ztuZal63nTHmC2PMBmPMOmPMg7Vs48R4uZPL52NmjGlsjFlmjFldk+t/atmmkTHmvZrxWmqMSfGTXLcbY3KPG6+7vZ2r5nVDjTErjTEf1/KYz8fKzVxOjdV2Y0xGzWum1/K4978XrbWOfAHnAQOAtSd5fBQwDzDAUGCpH2S6APjYgbFqAwyouR0NbAR6+MF4uZPL52NWMwZNa26HA0uBoSds8zPgpZrbY4H3/CTX7cDfHPg/9jAwo7Z/KyfGys1cTo3VdiD+FI97/XvRsT1uW/cixFcCb9pqS4BYY0wbhzM5wlq7x1q7oub2YWADcOLFpZ0YL3dy+VzNGBTV/Da85uvED3OuBN6ouT0LuMgYY/wgl88ZY5KA0cD0k2zi87FyM5e/8vr3oj/PcbcFdh33+934QSkAw2p+1J1njOnp6xev+TG1P9V7a8dzdLxOkQscGLOaH7FXAfuBhdbak46XtbYSKABa+EEugGtqfsSeZYxp5+1MwHPARKDqJI87MlZu5ALfjxVUv9kuMMYsN8aMq+Vxr38v+nNx1/aO7vTeyQqqT0vtC/wVmOPLFzfGNAU+AB6y1hae+HAtf8Qn41VHLkfGzFrrstb2A5KAwcaYXids4sh4uZHrIyDFWtsH+JT/7ul6hTHmcmC/tXb5qTar5T6vjpWbuXw6Vsc521o7ALgM+Lkx5rwTHvf6ePlzce8Gjn8HTQJyHMoCgLW28OiPutbauUC4MSbeF69tjAmnuhzfsdbOrmUTR8arrlxOjlnNa+YDXwIjT3jo2HgZY8KAZvhwmuxkuay1edbasprfvgwM9HKUs4ErjDHbgXeB4caYt0/YxomxqjOXA2N19HVzan7dD3wIDD5hE69/L/pzcf8buLXmE9qhQIG1do+TgYwxrY/O7RljBlM9fnk+eF0DvAJssNY+e5LNfD5e7uRyYsyMMQnGmNia25HAxUDmCZv9G7it5va1wOe25pMlJ3OdMBd6BdWfG3iNtXaStTbJWptC9QePn1trbz5hM5+PlTu5fD1WNa/ZxBgTffQ2cClw4lFoXv9edGyVd1PLIsRUf1iDtfYlYC7Vn85uBoqBO/wg07XAfcaYSqAEGOvt/8A1zgZuATJq5kcBHgeSj8vm8/FyM5cTY9YGeMMYE0r1G8X71tqPjTF/BNKttf+m+g3nLWPMZqr3Hsd6OZO7uR4wxlwBVNbkut0HuX7ED8bKnVxOjFUr4MOafZEwYIa19hNjzL3gu+9FnTkpIhJg/HmqREREaqHiFhEJMCpuEZEAo+IWEQkwKm4RkQCj4hYRCTAqbhGRAKPiFhEJMP8P96gWxowg7S8AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "model = LinearRegression().fit(X2, y)\n",
    "yfit = model.predict(X2)\n",
    "plt.scatter(x, y)\n",
    "plt.plot(x, yfit);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This idea of improving a model not by changing the model, but by transforming the inputs, is fundamental to many of the more powerful machine learning methods.\n",
    "We explore this idea further in [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) in the context of *basis function regression*.\n",
    "More generally, this is one motivational path to the powerful set of techniques known as *kernel methods*, which we will explore in [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb).\n",
    "\n",
    "这种改进模型的想法不是通过改变模型，而是通过转换输入，这对于许多更强大的机器学习方法是至关重要的。我们深入地探讨了这个概念：在基函数回归的上下文中线性回归。更一般地说，这是一种激励路径，用于一组被称为内核方法的强大技术，我们将深入探讨：支持向量机。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imputation of Missing Data 缺失值填充\n",
    "\n",
    "Another common need in feature engineering is handling of missing data.\n",
    "We discussed the handling of missing data in ``DataFrame``s in [Handling Missing Data](03.04-Missing-Values.ipynb), and saw that often the ``NaN`` value is used to mark missing values.\n",
    "For example, we might have a dataset that looks like this:\n",
    "\n",
    "\n",
    "功能工程的另一个常见需求是处理缺失的数据。我们讨论了在处理缺失数据时处理缺失数据的方法，并且经常看到NaN值被用来标记缺失的值。例如，我们可能有一个这样的数据集："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import nan\n",
    "X = np.array([[ nan, 0,   3  ],\n",
    "              [ 3,   7,   9  ],\n",
    "              [ 3,   5,   2  ],\n",
    "              [ 4,   nan, 6  ],\n",
    "              [ 8,   8,   1  ]])\n",
    "y = np.array([14, 16, -1,  8, -5])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When applying a typical machine learning model to such data, we will need to first replace such missing data with some appropriate fill value.\n",
    "This is known as *imputation* of missing values, and strategies range from simple (e.g., replacing missing values with the mean of the column) to sophisticated (e.g., using matrix completion or a robust model to handle such data).\n",
    "\n",
    "当将一个典型的机器学习模型应用于这些数据时，我们需要首先用适当的填充值替换这些缺失的数据。这被称为对缺失值的估算，策略范围从简单（例如，用列的平均值替换缺失的值）到复杂的（例如，使用矩阵完成或一个健壮的模型来处理这些数据）。\n",
    "\n",
    "The sophisticated approaches tend to be very application-specific, and we won't dive into them here.\n",
    "For a baseline imputation approach, using the mean, median, or most frequent value, Scikit-Learn provides the ``Imputer`` class:\n",
    "\n",
    "复杂的方法往往是非常特定于应用程序的，我们不会在这里深入研究它们。对于一般的填充方法，使用平均值、中值或众数，Scikit-Learn提供了``Imputer``类可以实现："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[4.5, 0. , 3. ],\n",
       "       [3. , 7. , 9. ],\n",
       "       [3. , 5. , 2. ],\n",
       "       [4. , 5. , 6. ],\n",
       "       [8. , 8. , 1. ]])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import Imputer\n",
    "imp = Imputer(strategy='mean')\n",
    "X2 = imp.fit_transform(X)\n",
    "X2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that in the resulting data, the two missing values have been replaced with the mean of the remaining values in the column. This imputed data can then be fed directly into, for example, a ``LinearRegression`` estimator:\n",
    "\n",
    "我们看到，在生成的数据中，两个缺失的值已经被列中的剩余值的平均值所取代。然后，这些估算数据可以直接输入，例如，线性回归 评估器："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([13.14869292, 14.3784627 , -1.15539732, 10.96606197, -5.33782027])"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = LinearRegression().fit(X2, y)\n",
    "model.predict(X2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Pipelines  特征管道\n",
    "\n",
    "With any of the preceding examples, it can quickly become tedious to do the transformations by hand, especially if you wish to string together multiple steps.\n",
    "For example, we might want a processing pipeline that looks something like this:\n",
    "\n",
    "对于前面的例子，手工进行转换会很快变得单调乏味，特别是如果您希望将多个步骤串在一起。例如，我们可能想要一个像这样的处理管道：\n",
    "\n",
    "1. Impute missing values using the mean  使用平均值来估算缺失值\n",
    "2. Transform features to quadratic 将衍生特征转换为二次方\n",
    "3. Fit a linear regression  拟合线性回归\n",
    "\n",
    "To streamline this type of processing pipeline, Scikit-Learn provides a ``Pipeline`` object, which can be used as follows:\n",
    "\n",
    "为了简化这种类型的处理管道，Scikit-Learn提供了一个``Pipeline``对象，可以使用如下："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "model = make_pipeline(Imputer(strategy='mean'),\n",
    "                      PolynomialFeatures(degree=2),\n",
    "                      LinearRegression())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This pipeline looks and acts like a standard Scikit-Learn object, and will apply all the specified steps to any input data.\n",
    "\n",
    "这个管道看起来就像一个标准的Scikit-Learn对象，并将所有指定的步骤应用到任何输入数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[14 16 -1  8 -5]\n",
      "[ 14.  16.  -1.   8.  -5.]\n"
     ]
    }
   ],
   "source": [
    "model.fit(X, y)  # X with missing values, from above\n",
    "print(y)\n",
    "print(model.predict(X))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All the steps of the model are applied automatically.\n",
    "Notice that for the simplicity of this demonstration, we've applied the model to the data it was trained on; this is why it was able to perfectly predict the result (refer back to [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) for further discussion of this).\n",
    "\n",
    "模型的所有步骤都是自动应用的。注意，为了简化这个演示，我们已经将模型应用到训练数据上;模型能够完美地预测结果。\n",
    "\n",
    "For some examples of Scikit-Learn pipelines in action, see the following section on naive Bayes classification, as well as [In Depth: Linear Regression](05.06-Linear-Regression.ipynb), and [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--NAVIGATION-->\n",
    "< [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb) | [Contents](Index.ipynb) | [In Depth: Naive Bayes Classification](05.05-Naive-Bayes.ipynb) >"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
